Exploratory data analysis is the process of exploring a given dataset without any presumptions about the content of the data . The data in this analysis is maintained by the ‘Centre for Policing Equity’ has 2383 instances of crime data as reported and stored by The Police Department of Texas , Dallas. This report aims for an in depth analysis of the information available from the data to detect patterns and if present biases in the data.
The data is maintained for the incidents were force was employed by police officers on subjects and provides information about the race and genders of the subjects and the officers , time data and location of incidents , whether there was an injury involved and if there was the type of the injury and the reasons for force employed and the types if force was used. In the report we’ll go through an analysis of the said sections and explanations through possible visualizations.
Race and Gender of Subjects and Officers
The data provides race classifications for both demographics , and upon analysis the number of each show as the following table
| Race | Proportion |
|---|---|
| American Ind | 0.0033571 |
| Asian | 0.0230802 |
| Black | 0.1430969 |
| Hispanic | 0.2022661 |
| Other | 0.0113303 |
| White | 0.6168695 |
| Race | Proportion |
|---|---|
| American Ind | 0.0004196 |
| Asian | 0.0020982 |
| Black | 0.5593789 |
| Hispanic | 0.2198909 |
| Other | 0.0209820 |
| White | 0.1972304 |
From the tables it is clear that the classes in the data are highly imbalanced in their distribution. The number of subjects belonging to minority classes are at a clear higher margin than those belonging to majority classes. The gender imbalance follows a similar pattern , with males being at a higher count that all the other genders. The females subjects having a count of 440 as opposed to 1932 males forms 18.5% while males form the larger 81%. Similar patterns for female officers forming 10% of the population while male officers in role take the rest ~90%. The gender race plots given below indicate the same .
The larger area in the mosaic plot for officers indicate to the statements above , i.e , Male White officers are largest in numbers (1336) , and the similar large tile for Black Male in the subjects plot. We explore the relationship between the subjects race and the whether the subject is arrested or not. For this data provides information about the arrest in a binary form i.e , yes for being arrested and no , for the opposite. To look for any correlation between the two , we perform a Pearson’s Chi-Squared test.
##
## Pearson's Chi-squared test
##
## data: data$SUBJECT_RACE and data$SUBJECT_WAS_ARRESTED
## X-squared = 15.777, df = 5, p-value = 0.007512
The formula gives us a p-value of 0.007512 , which is much lower than the set significance level 0.05. This indicates a dependency between the two variables. Considering for the imbalance among classes , a prediction using these numbers would give us a “Black male” as being more likely to be arrested.
Officer years on force
Next we take a look at the number of years officers have been on the job to determine whether the experience effects subjects being arrested or not.The experience ranges from 0 to over 30 years of experience . We look at a point plot coloured in terms of if the subject was arrested or not pointed over the years on force of officers.
From the above plot , although strict margins cannot be drawn , it is clear that the number of arrests are more for less experienced officers in the early years of employment , and with increase in the number of years , the margins are much lower. The highest count of arrests are at a frequency of almost 300 arrests being made by officers two years of experience.
Associations between various variables
On this note , we take a look at the associations between various labels (variables) to see whether there any more such dependancies. For this we use a co-relation plot. For computing purposes of these nominal categorical data , that is , data that are not numbers but have several unordered types , for example . types of cats are Persian , British Shorthair etc. The type do not have an order , they are just different. Hence for our data the Goodman and Kruskal’s tau method is appropriate. (links for further reading will be given at the end of the document). The analysis gives us the following correlation plot. Higher values and distended circles show stronger associations. Bear in mind these are associations and not causality.
## NULL
As is evident from the plot , the most obvious feature is the biggest number 0.83 between “SUBJECT_OFFENSE” and “SUBJECT_WAS_ARRESTED”. This forward association suggests that if we know the subject’s offense then it is easy to predict whether the subject will be arrested or not. The opposite association however , that is to determine the offense from the information on whether the subject was arrested or not ,is not possible (value=0.1 , weak association). Similar strong associations between “SUBJECT_OFFENSE” and “SUBJECT_DESCRIPTION” (0.38) ,“SUBJECT_OFFENSE” and “REASON_FOR_FORCE”(0.34) , AND “SUBJECT_OFFENSE” and “INCIDENT_REASON”(0.32) are to be noticed.[6]
Subject descriptions
We take a look at the descriptions of the subjects as given by the officers that respond to the situation. We have a total of 14 variations of descriptions , including three variations relating to drugs or alcohol. A coloumn graph displays as the following.
The highest count is for the description “Unknown” , 440 cases , followed by “Mentally Unstable” at 412 cases. We then do a simple analysis to see how many subjects described as mentally unstable were injured by the use of force. There are 119 such cases. We also see how many of the same subjects were restrained by weapon on display by the police officer. This could be a baton , a taser or a gun. There are 22 such cases. We also look at the ratios of subjects that were injured. 26% percent of all the subjects reported to have been injured by the use of force.
Who is more likely to be pulled over?
Filtering the data for instances were subjects were pulled over by the police, we want to see whether there is a likelihood of any race being pulled over more than others. The following plot shows this.
If we interpret the graph to predict who is more likely to be pulled over , it will be a black male , followed by a hispanic male. The data is certainly skewed and imbalanced. The black males amount to 34 cases of the total 93 cases of traffic stops.
Time series analysis : Hours Days and Months.
Next we go through a time series analysis of how incident rates vary with time in terms of time in 24 hours and also through months and days. The data is limited only to the year 2016 , and hence the analysis.The heatmap below shows the distribution of incidents through days of the months and 12 months of the year.
From the heatmap , there are no clear shades to be defined , however the darker tiles and missing tiles are much more frequent in the later half of the year. The darker tiles are continuous in December.Something fun to notice is among the days with highest count is 14th of February , the world famous Valentine’s day ! High crime on a day dedicated to affection !
Now a similar look through 24 hours. What times of the day do the crimes occur more? Do the police use force away from the public eye? The animated plot below takes us through the timeline.
From the line graph , it is clear that cases where the force was employed are higher after dusk and in the early hours of the mornings . However, clear margins cannot be drawn from the graph.
Geographical analysis : Where are these crimes occuring?
The data gives information about the geolocation of the instances which includes, the Divisions , Sectors , Beats , latitute and longitude etc . The city of Dallas is divided into 7 divisions , namely Central , North Central , Northeast , Northwest , South Central , Southeast and Southwest. Looking at the data from this we obtain the following plot. Each division has 5 Sectors.
From the graph it is evident that Central has the highest number of instances of force used and in Central , Sector 130 , has the highest count of 259 of the total cases . A proportion table would look like the following. Central accounts for 23% of all the cases.
| Division | Proportion |
|---|---|
| CENTRAL | 0.2362568 |
| NORTH CENTRAL | 0.1338649 |
| NORTHEAST | 0.1430969 |
| NORTHWEST | 0.0801511 |
| SOUTH CENTRAL | 0.1300881 |
| SOUTHEAST | 0.1519094 |
| SOUTHWEST | 0.1246328 |
Among the divisions , we’d like to see which has the largest drug problem. This can be obtained from the subject descriptions that has values like “Unknown Drugs” , “Marijuana” and “Alcohol and unknown drugs”. Analysis reveals 648 cases of drug problems , which accounts to 27.2% percent of all problems from the data. Checking for divisions which the largest drug problem we get the following point plot.
Central has the highest values in Unknown Drugs as well as Alcohol and Unknown Drugs , but scores low for Marijuana . However due to a very large difference with other divisions for unknown drugs , Central is still in the lead for drug problem , totaling to a number of 177 cases , i.e , 27%. The lowest cases are from Northwest , accounting to only 7%.
We also look if there is a race issue in the drug data. When we take a look at the proportions , “Black” accounts for 56% percent for the cases. Among this we check for cases were the subject was arrested and also injured during the procedure. Out of the 368 cases , 321 were arrested and 91 injured during the process.
For further geographical analysis we try to point location , that is the latitude and longitude available through the data onto a map.
An obvious feature from the map is the clustering around the central portion which is where the division Central is located. If we scroll over the data points , we also get the offenses the data points represent.
DISCUSSION
The imbalance in class distribution in the race and gender of the subjects and officers is concerning , for two main reasons. First , these imbalances could possible point towards racially profiling practiced by the Police Department of Dallas , Texas. However , since the data is limited in terms of the time span , i.e , only through one year , this is not enough to come to that conclusion. Second , if the data were ever to be used for predicting crimes (predictive policing), this would give out , for example , a Black male , or Hispanic male more likely to commit a drug related crime , and such a conclusion is damaging on several layers. This could lead to biased decisions made by arbiters of justice, under the false notion of the predicting system being subjective. This is simply not true , as unfortunately , machines tend to inherit human biases(we can read more on this from references)[4].
From the analysis of subject descriptions as given by the police officers, there was a large population described as “Mentally unstable” , and further through our analysis we see that some of them injured or threatened by display of weapon. The use of force by the police should be condemnable under any situation , however considering it might be necessary at times to contain situations , the use of such methods on subjects described as mentally unstable is a cause of concern. This could deter the mental state of the subject even further and hence such situations should be dealt with professionals from relevant fields or the police should be least accompanied by them. Such measure could be effective in de-escalation of otherwise dangerous situations[5].
The bias is again evident in our analysis of who would be more likely to be stopped.
Areas that encounter higher rates of drug crimes , should also be equipped with higher number of help (rehabilitation centres). In such cases as well, involvement by professionals of relevant careers could change the numbers drastically in the future.
References
[1] https://github.com/abhimotgi/dataslice/blob/master/R/Interactive%20Graphs.Rmd
[2] https://rpubs.com/hoanganhngo610/558925
[3] https://mapping-in-r.netlify.app/
[4] https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[5] https://www.joincampaignzero.org/solutions
[6] https://www.r-bloggers.com/2017/05/to-eat-or-not-to-eat-thats-the-question-measuring-the-association-between-categorical-variables/
[7] https://github.com/abhimotgi/dataslice/blob/master/R/gganimate%20code.R